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Abstract 


In logistic regression, separation occurs when a linear combination of the predictors 
can perfectly classify part or all of the observations in the sample, and as a result. 


finite maximum likelihood estimates of the regression coefficients do not exist. Gelman 


et al. (2008) recommended independent Cauchy distributions as default priors for the 


regression coefficients in logistic regression, even in the case of separation, and reported 
posterior modes in their analyses. As the mean does not exist for the Cauchy prior, a 
natural question is whether the posterior means of the regression coefficients exist under 
separation. We prove theorems that provide necessary and sufficient conditions for the 
existence of posterior means under independent Cauchy priors for the logit link and a 
general family of link functions, including the probit link. We also study the existence 
of posterior means under multivariate Cauchy priors. For full Bayesian inference, we 
develop a Cibbs sampler based on Polya-Camma data augmentation to sample from the 
posterior distribution under independent Student-f priors including Cauchy priors, and 
provide a companion R package in the supplement. We demonstrate empirically that 
even when the posterior means of the regression coefficients exist under separation, 
the magnitude of the posterior samples for Cauchy priors may be unusually large, 
and the corresponding Cibbs sampler shows extremely slow mixing. While alternative 
algorithms such as the No-U-Turn Sampler in Stan can greatly improve mixing, in 
order to resolve the issue of extremely heavy tailed posteriors for Cauchy priors under 
separation, one would need to consider lighter tailed priors such as normal priors or 
Student-f priors with degrees of freedom larger than one. 
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1 Introduction 


In Bayesian linear regression, the choice of prior distribution for the regression coefficients 
is a key component of the analysis. Noninformative priors are convenient when the analyst 
does not have much prior information, but these prior distributions are often improper 


which can lead to improper posterior distributions in certain situations. Fernandez and 


Steel (2000) investigated the propriety of the posterior distribution and the existence of 


posterior moments of regression and scale parameters for a linear regression model, with 
errors distributed as scale mixtures of normals, under the independence Jeffreys prior. For 
a design matrix of full column rank, they showed that posterior propriety holds under mild 
conditions on the sample size; however, the existence of posterior moments is affected by the 
design matrix and the mixing distribution. Further, there is not always a unique choice of 
noninformat ive prior ( Yang and Bergei]|1996 ). On the other hand, proper prior distributions 
for the regression coefficients guarantee the propriety of posterior distributions. Among 
them, normal priors are commonly used in normal linear regression models, as conjugacy 
permits efficient posterior computation. The normal priors are informative because the 
prior mean and covariance can be specihed to reflect the analyst’s prior information, and 
the posterior mean of the regression coefficients is the weighted average of the maximum 
likelihood estimator and the prior mean, with the weight on the latter decreasing as the 
prior variance increases. 

A natural alternative to the normal prior is the Student-f prior distribution, which can 
be viewed as a scale mixture of normals. The Student-f prior has tails heavier than the 
normal prior, and hence is more appealing in the case where weakly informative priors are 
desirable. The Student-f prior is considered robust, because when it is used for location 


parameters, outliers have vanishing influence on posterior distributions (Dawid 1973). The 
Cauchy distribution is a special case of the Student-f distribution with 1 degree of freedom. 
It has been recommended as a prior for normal mean parameters in a point null hypothesis 
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testing (Jeffreys 1961), because if the observations are overwhelmingly far from zero (the 
value of the mean specihed under the point null hypothesis), the Bayes factor favoring the 
alternative hypothesis tends to inhnity. Multivariate Cauchy priors have also been proposed 


for regression coefficients (Zellner and Siow 1980). 

While the choice of prior distributions has been extensively studied for normal linear 
regression, there has been comparatively less work for generalized linear models. Propriety of 
the posterior distribution and the existence of posterior moments for binary response models 
under different noninformative prior choices have been considered (Ibrahim and Laud [19^ 


Chen and Shao 2001). 


Regression models for binary response variables may suffer from a particular problem 
known as separation, which is the focus of this paper. For example, complete separation oc¬ 
curs if there exists a linear function of the covariates for which positive values of the function 
correspond to those units with response values of 1, while negative values of the function 


correspond to units with response values of 0. Formal dehnitions of separation (Albert and 


Anderson||1984), including complete separation and its closely related counterpart quasicom- 


plete separation, are reviewed in Section 2. Separation is not a rare problem in practice, 
and has the potential to become increasingly common in the era of big data, with analysis 
often being made on data with a modest sample size but a large number of covariates. When 


separation is present in the data, Albert and Anderson (1984) showed that the maximum 
likelihood estimates (MLEs) of the regression coefficients do not exist (i.e., are inhnite). Re¬ 
moving certain covariates from the regression model may appear to be an easy remedy for 
the problem of separation, but this ad-hoc strategy has been shown to often result in the 


removal of covariates with strong relationships with the response (Zorn 2005). 

In the frequentist literature, various solutions based on penalized or modihed likelihoods 


have been proposed to obtain hnite parameter estimates (Firth 1993 Heinze and Schemper 


2002 Heinzej2006| Rousseeuw and Christmann 2003). The problem has also been noted when 


htting Bayesian logistic regression models (Clogg et ah 1991), where posterior inferences 
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would be similarly affected by the problem of separation if using improper priors, with the 


possibility of improper posterior distributions (Speckman et ah 2009). 


Gelman et ah (2008) recommended using independent Cauchy prior distributions as a 


default weakly informative choice for the regression coefficients in a logistic regression model, 
because these heavy tailed priors avoid over-shrinking large coefficients, but provide shrinkage 
(unlike improper uniform priors) that enables inferences even in the presence of complete 


separation. Gelman et al. (2008) developed an approximate EM algorithm to obtain the 
posterior mode of regression coefficients with Cauchy priors. While inferences based on the 
posterior mode are convenient, often other summaries of the posterior distribution are also of 
interest. For example, posterior means under Cauchy priors estimated via Monte Carlo and 


other approximations have been reported in Bardenet et al. (2014); Chopin and Ridgway 


(2015). It is well-known that the mean does not exist for the Cauchy distribution, so clearly 
the prior means of the regression coefficients do not exist. In the presence of separation, 
where the maximum likelihood estimates are not hnite, it is not clear whether the posterior 
means will exist. To the best of our knowledge, there has been no investigation considering 
the existence of the posterior mean under Cauchy priors and our research is hlling this gap. 
We hnd a necessary and sufficient condition where the use of independent Cauchy priors will 
result in hnite posterior means here. In doing so we provide further theoretical underpinning 


of the approach recommended by Gelman et al. (2008), and additionally provide further 
insights on their suggestion of centering the covariates before htting the regression model, 
which can have an impact on the existence of posterior means. 

When the conditions for existence of the posterior mean are satished, we also empirically 
compare different prior choices (including the Cauchy prior) through various simulated and 
real data examples. In general, posterior computation for logistic regression is known to be 
more challenging than probit regression. Several MCMC algorithms for logistic regression 
have been proposed (O’Brien and Dunson|2004 Holmes and Held|2006 Gramacy and Poison 


2012 

), while the most recent Polya-Gamma data augmentation scheme of 

Poison et al. 

(2013 
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emerged superior to the other methods. Thus we extend this Polya-Gamma Gibbs sampler 
for normal priors to accommodate independent Student-t priors and provide an R package 
to implement the corresponding Gibbs sampler. 


The remainder of this article is organized as follows. In Section 2 we derive the theo¬ 
retical results: a necessary and sufficient condition for the existence of posterior means for 
coefficients under independent Gauchy priors in a logistic regression model in the presence 
of separation, and extend our investigation to binary regression models with other link func¬ 
tions such as the probit link, and multivariate Gauchy priors. In Section 3 we develop a 
Gibbs sampler for the logistic regression model under independent Student-t prior distri¬ 
butions (of which the Gauchy distribution is a special case) and briefly describe the NUTS 


algorithm of Hoffman and Gelman (2014) which forms the basis of the software Stan. In 
Section 4 we illustrate via simulated data that Gauchy priors may lead to coefficients of 
extremely large magnitude under separation, accompanied by slow mixing Gibbs samplers, 
compared to lighter tailed priors such as Student-t priors with degrees of freedom 7 (fy) or 
normal priors. In Section 5 we compare Gauchy, ty, and normal priors based on two real 
datasets, the SPEGT data with quasicomplete separation and the Pima Indian Diabetes data 
without separation. Overall, Gauchy priors exhibit slow mixing under the Gibbs sampler 
compared to the other two priors. Although mixing can be improved by the NUTS algorithm 
in Stan, normal priors seem to be the most preferable in terms of producing more reason¬ 
able scales for posterior samples of the regression coefficients accompanied by competitive 
predictive performance, under separation. In Section 6 we conclude with a discussion and 
our recommendations. 


2 Existence of Posterior Means Under Cauchy Priors 

In this section, we begin with a review of the concepts of complete and quasicomplete 


separation proposed by Albert and Anderson (1984). Then based on a new concept of solitary 
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separators, we introduce the main theoretical result of this paper, a necessary and sufficient 
condition for the existence of posterior means of regression coefficients under independent 
Cauchy priors in the case of separation. Finally, we extend our investigation to binary 
regression models with other link functions, and Cauchy priors with different scale parameter 
structures. 

Let y = (?/i, j/ 2 , ... , i/n)^ denote a vector of independent Bernoulli response variables 
with success probabilities 7ri,7r2,... ,7r„. For each of the observations, i = 1,2,... ,n, let 
Xj = (xji, Xi 2 , ■ ■ ■, Xip)"^ denote a vector of p covariates, whose hrst component is assumed to 
accommodate the intercept, i.e., Xi^i = 1. Let X denote the n x p design matrix with xf 
as its fth row. We assume that the column rank of X is greater than 1. In this paper, we 
mainly focus on the logistic regression model, which is expressed as: 

= xf/3, f = l,2, ...,n, (1) 

where /3 = (/9i, /I 2 , • • ■, is the vector of regression coefficients. 


2.1 A Brief Review of Separation 

We denote two disjoint subsets of sample points based on their response values: Aq = {i : 
Pi = 0} and Ai = {i Di = 1}. According to the dehnition of Albert and Anderson (1984[), 


complete separation occurs in the sample if there exists a vector a. = (ai,a 2 , ■ ■ ■, ctp)'^, such 
that for alH = 1, 2,..., n. 


xfo > 0 if i G Ai, xfo < 0 if i G Aq. 


( 2 ) 


Consider a simple example in which we wish to predict whether subjects in a study have 
a certain kind of infection based on model Q. Let j/j = 1 if the Ah subject is infected 
and 0 otherwise. The model includes an intercept [xn = 1) and the other covariates are 


6 










age ( 3 :^ 2 ), gender {xis), and previous records of being infected {xi 4 ). Suppose in the sample, 
all infected subjects are older than 25 (xi 2 > 25), and all subjects who are not infected 
are younger than 25 {xi 2 < 25). This is an example of complete separation because (|^ is 
satisfied for ck = (—25,1, 0, 0)^. 


If the sample points cannot be completely separated, Albert and Anderson (1984) intro¬ 
duced another notion of separation called quasicomplete separation. There is quasicomplete 
separation in the sample if there exists a non-null vector ck = (ai, 02 ,..., ctp)^, such that 
for alH = 1, 2,..., n. 


x[ct >0 if i G Ai, xf ct <0 if i G Aq, 


( 3 ) 


and equality holds for at least one i. Consider the set up of the previous example where the 
goal is to predict whether a person is infected or not. Suppose we have the same model but 
there is a slight modihcation in the dataset: all infected subjects are at least 25 years old 
{xi 2 > 25), all uninfected subjects are no more than 25 years old {xi 2 < 25), and there are 
two subjects aged exactly 25, of whom one is infected but not the other. This is an example 
of quasicomplete separation because ([^ is satisfied for a = (—25,1, 0, 0)^ and the equality 
holds for two observations with age exactly 25. 

Let C and Q denote the set of all vectors a that satisfy (|^ and (§. respectively. For 
any ck G C, all sample points must satisfy (|^, so ck cannot lead to quasicomplete separation 
which requires at least one equality in ([^. This implies that C and Q are disjoint sets, while 


both can be non-empty for a certain dataset. Note that Albert and Anderson (1984) define 


quasicomplete separation only when the sample points cannot be separated using complete 
separation. Thus according to their definition, only one of C and Q can be non-empty for a 
certain dataset. However, in our slightly modihed definition of quasicomplete separation, the 
absence of complete separation is not required. This permits both C and Q to be non-empty 
for a dataset. In the remainder of the paper, for simplicity we use the term “separation” to 
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refer to either complete or quasicomplete separation, so that C U Q is non-empty. 


2.2 Existence of Posterior Means Under Independent Cauchy Pri¬ 


ors 


When Markov chain Monte Carlo (MCMC) is applied to sample from the posterior distribn- 
tion, the posterior mean is a commonly used summary statistic. We aim to study whether 
the marginal posterior mean E{(3j \ y) exists under the independent Cauchy priors suggested 


by Gelman et ah (2008). Let C{jjL^a) denote a Cauchy distribution with location parameter 


yU and scale parameter a. The default prior suggested by Gelman et ah (2008) corresponds 


to Pj G(0, (Tj), for j = 1,2 ,...,p. 

For a design matrix with full column rank, Albert and Anderson ( |1984 ) showed that a 
hnite maximum likelihood estimate of f3 does not exist when there is separation in the data. 
However, even in the case of separation and/or a rank dehcient design matrix, the posterior 
means for some or all /d^-’s may exist because they incorporate the information from the 


prior distribution. Following Dehnition 2.2.1 of Gasella and Berger (1990, pp. 55), we say 
E{(3j I y) exists if E{\(3j\ | y) < cx), and in this case, E{(3j \ y) is given by 


EWj I y) = / Pj p{Pj I y) dPj + / p{Pj I y) dPj 


(4) 


Note that alternative dehnitions may require only one of the integrals in (|^ to be hnite 
for the mean to exist, e.g., Bickel and Doksum ( 2001| pp. 455). However, according to the 
dehnition used in this paper, both integrals in (|^ have to be hnite for the posterior mean to 
exist. Our main result shows that for each j = 1, 2,... ,p, the existence of E{(3j \ y) depends 
on whether the predictor Xj is a solitary separator or not, which is dehned as follows: 


Definition 1. The predictor is a solitary separator, if there exists an a E {C U Q) such 




















that 


aj 7 ^ 0, Or = 0 for all r ^ j. 


( 5 ) 


This definition implies that for a solitary separator Xj, if aj > 0, then Xij > 0 for all 
i G Ai, and Xij < 0 for all i G Aq] if aj < 0, then Xtj < 0 for all i G Ai, and Xij > 0 for 
all i G Aq. Therefore, the hyperplane {x G : Xj = 0} in the predictor space separates 
the data into two groups Ai and Aq (except for the points located on the hyperplane). The 
following theorem provides a necessary and sufficient condition for the existence of marginal 
posterior means of regression coefficients in a logistic regression model. 

Theorem 1 . In a logistic regression model, suppose the regression coefficients (including the 
intercept) ftj (7(0, aj) with aj > 0 for j = 1,2, ... ,p, so that 


p{(3)=YiPiPj )=n 

i=i i=i 


1 


( 6 ) 


Then for each j = 1,2,... ,p, the posterior mean E{l3j \ y) exists if and only if^j is not a 
solitary separator. 


A proof of Theorem is available in Appendices and 

Remark 1. Theorem\^ implies that under independent Cauchy priors in logistic regression, 
the posterior means of all coefficients exist if there is no separation, or if there is separation 
with no solitary separators. 


Remark 2. Gelman et al. (2008) suggested centering all predictors (except interaction 
terms) in the pre-proeessing step. A conseguence of Theorem^is that centering may have a 
crucial role in the existence of the posterior mean E{/3j \ y). 


We expand on the second remark with a toy example where a predictor is a solitary 
separator before centering but not after centering. Consider a dataset with n = 100, y = 
(0,... 0,1,..., 1)^ and a binary predictor Xj = (0,... 0,1,..., 1)^. Here X^ is a solitary 
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separator which leads to quasicomplete separation before centering. However, the centered 
predictor = (—0.5, • • • — 0.5,0.5,..., 0.5)^ is no longer a solitary separator because after 

50 50 

centering the hyperplane {x ; Xj = —0.5} separates the data but {x : Xj = 0} does not. 
Consequently, the posterior mean \ y) does not exist before centering but it exists after 
centering. 


2.3 Extensions of the Theoretical Result 


So far we have mainly focused on the logistic regression model, which is one of the most widely 
used binary regression models because of the interpretability of its regression coefficients in 
terms of odds ratios. We now generalize Theorem to binary regression models with link 
functions other than the logit. Following the definition in McCullagh and Nelder| (1989, pp. 
27), we assume that for i = 1,2,..., n, the linear predictor xf/3 and the success probability 
TTj are connected by a monotonic and differentiable link function g[-) such that gi^i) = xf/3. 
We further assume that g{.) is a one-to-one function, which means that g{.) is strictly 
monotonic. This is satisfied by many commonly used link functions including the probit. 
Without loss of generality, we assume that g{-) is a strictly increasing function. 


Theorem 2. In a binary regression model with link function g{.) described above, suppose the 
regression coefficients have independent Cauchy priors in 0 - Then for each j = 1,2,... ,p, 

(1) a necessary condition for the existence of the posterior mean E{(3j \ y) is that Xj is 
not a solitary separator; 

(2) a sufficient condition for the existence of E[(3j \ y) consists of the following: 

(i) Xj is not a solitary separator, and 
(a) Ve > 0, 



PjPWj)9 \-^Pj)dPj < oo. 



PjPiPj) [1 


g ^(e/3j)] dl3j < oo. 


(7) 
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Note that Q in the sufficient condition of Theorem imposes constraints on the link 
function g{.), and hence the likelihood function. A proof of this theorem is given in Appendix 
Moreover, it is shown that condition Q holds for the probit link function. 

In certain applications, to incorporate available prior information, it may be desirable to 
use Cauchy priors with nonzero location parameters. The following corollary states that for 
both logistic and probit regression, the condition for existence of posterior means derived in 
Theorems and 1^ continues to hold under independent Cauchy priors with nonzero location 
parameters. 


Corollary 1. In logistic and probit regression models, suppose the regression coefficients 
/3j C{fij,aj), for j = 1,2,... ,p. Then a necessary and sufficient condition for the exis¬ 
tence of the posterior mean E{(3j \ y) is that Xj is not a solitary separator, forj = 1, 2,... ,p. 


A proof of Corollary is available in Appendix 

In some applications it could be more natural to allow the regression coefficients to be 
dependent, a priori. Thus in addition to independent Cauchy priors, we also study the 
existence of posterior means under a multivariate Cauchy prior, with the following density 
function: 


p(/3) 


rm 


|)!rl|S|5 [1 + (/3 -m)’’S 


1+p 


( 8 ) 


where (3 G is a p x 1 location parameter and S is a p x p positive-dehnite scale 


matrix. A special case of the multivariate Cauchy prior is the Zellner-Siow prior (Zellner 


and Siow 1980). It can be viewed as a scale mixture of p-priors, where conditional on g, (3 
has a multivariate normal prior with a covariance matrix proportional to p(X^X)“^, and 
the hyperparameter g has an inverse gamma prior, IG(l/2, n/2). Based on generalizations of 
the p-prior to binary regression models (|Fouskakis et ah 2009 Sabanes Bove and Held [ 


Hanson et ah [2014), the Zeller-Siow prior, which has a density ([^ with S cx n(X^X) 


can 


be a desirable objective prior as it preserves the covariance structure of the data and is free 
of tuning parameters. 
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Theorem 3. In logistic and probit regression models, suppose the vector of regression co¬ 
efficients [3 has a multivariate Cauchy prior as in ([^. If there is no separation, then all 
posterior means E{(3j \ y) exist, for j = l,2,...,p. If there is complete separation, then 
none of the posterior means E{/3j \ y) exist, for j = 1,2,... ,p. 

A proof of Theorem is available in Appendices and The study of existence of 
posterior means under multivariate Cauchy priors in the presence of quasicomplete separation 
has proved to be more challenging. We hope to study this problem in future work. Note 
that although under (|^, the induced marginal prior of (Ij is a univariate Cauchy distribution 
for each j = 1,2,... ,p, the multivariate Cauchy prior is different from independent Cauchy 
priors, even with a diagonal scale matrix S = diag((j^, erf,... ,o'p). In fact, as a rotation 
invariant distribution, the multivariate Cauchy prior places less probability mass along axes 
than the independent Cauchy priors (see Figure [^. Therefore, it is not surprising that 
solitary separators no longer play an important role for existence of posterior means under 
multivariate Cauchy priors, as evident from Theorem 


Independent Cauchy Bivariate Cauchy 




Figure 1: Contour plots of log-density functions of independent Cauchy distributions with 
both scale parameters being 1 (left) and a bivariate Cauchy distribution with scale matrix 
I 2 (right). These plots suggest that independent Cauchy priors place more probability mass 
along axes than a multivariate Cauchy prior, and thus impose stronger shrinkage. Hence, if 
complete separation occurs, E{IIj \ Y) may exist under independent Cauchy priors for some 
or all j = l,2,...,p (Theorem [^, but does not exist under a multivariate Cauchy prior 
(Theorem]^. 
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So far we have considered Cauchy priors, which are t distributions with 1 degree of 
freedom. We close this section with a remark on lighter tailed t priors (with degrees of 
freedom greater than 1) and normal priors, for which the prior means exist. 

Remark 3. In a binary regression model, suppose that the regression coefficients have in¬ 
dependent Student-t priors with degrees of freedom greater than one, or independent normal 
priors. Then it is straightforward to show that the posterior means of the coefficients exist 
because the likelihood is bounded above by one and the prior means exist. The same result 
holds under multivariate t priors with degrees of freedom greater than one, and multivariate 
normal priors. 


3 MCMC Sampling for Logistic Regression 


In this section we discuss two algorithms for sampling from the posterior distribution for 
logistic regression coefficients under independent Student-t priors. We hrst develop a Gibbs 
sampler and then briefly describe the No-U-Turn Sampler (NUTS) implemented in the freely 
available software Stan (Carpenter et ah] 2016). 


3.1 Polya-Gamma Data Augmentation Gibbs Sampler 


Poison et ah (2013) showed that the likelihood for logistic regression can be written as a 


mixture of normals with respect to a Polya-Gamma (PC) distribution. Based on this result, 
they developed an efficient Gibbs sampler for logistic regression with a multivariate normal 


prior on /3. Choi and Robert (2013) showed that their Gibbs sampler is uniformly ergodic. 


This guarantees the existence of central limit theorems for Monte Carlo averages of functions 


of /3 which are square integrable with respect to the posterior distribution p{(3 \ y). Choi 


and Robert (2013) developed a latent data model which also led to the Gibbs sampler of 


Poison et ah (2013). We adopt their latent data formulation to develop a Gibbs sampler for 


logistic regression with independent Student-t priors on (3. 
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Let U = { 2 / 7 ^“^) Wi/{21 — 1)^, where hhi, W 2 ,... is a sequence of i.i.d. Exponential 

random variables with rate parameters equal to 1. The density of U is given by 


h{u) = 


00 

«=o 






0 < M < CX). 


TTW-' 


( 9 ) 


Then for /c > 0, the Polya-Gamma (PG) distribution is constructed by exponential tilting 
of h{u) as follows: 


p(m; k) = cosh 



0 < u < cx). 


( 10 ) 


A random variable with density p(u; k) has a PG(1, k) distribution. 

Let f„(0, aj) denote the Student-t distribution with v degrees of freedom, location param¬ 
eter 0, and scale parameter aj. Since Student-t distributions can be expressed as inverse- 
gamma (IG) scale mixtures of normal distributions, for j = 1, 2,... ,p, we have: 


/3j ~ ty{0,aj 


13j I ~ N(0,7j), 


7,~IG 


Gonditional on (3 and P = diag( 7 i, 72 ,..., 7p), let {vi, zi), ( 1 / 2 , 2 ^ 2 ),..., (l/n, Zn) be n in¬ 
dependent random vectors such that Ui has a Bernoulli distribution with success proba¬ 
bility exp(xf/3)/(l -|- exp(xf/3)), Zi ~ PG(1, |xf/3|), and Ui and Zi are independent, for 
i = 1,2, ...,n. Let Zj:, = diag(zi, Z 2 , ■ ■ ■, Zn), then the augmented posterior density is 
p(P,r,ZD I y). We develop a Gibbs sampler with target distribution p(/3, T^Zr, \ y), which 
cycles through the following sequence of distributions iteratively: 


1 . /3 I r, Zz,, y ~ N ((X^ZzjX + (X^Z^X + , where yi = yi - 1/2 

andy = {yi,y 2 ,... ,ynV, 

2. ^J/3,Z,,,y'~"lGf^,^),forj = l,2,...,p, 


3. 2 :* I r, /3, y PG(1, |xf/3|), for i = 1,2,..., n. 
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Steps 1 and 3 follow immediately from Choi and Robert (2013); Poison et al. (2013) and 


step 2 follows from straightforward algebra. In the next section, for comparison of posterior 
distribntions nnder Stndent-t priors with different degrees of freedom, we implement the 


above Gibbs sampler, and for normal priors we apply the Gibbs sampler of Poison et ah 


(2013). Both Gibbs samplers can be implemented nsing the R package tglm, available in 


the snpplement. 


3.2 Stan 

Onr empirical resnlts in the next section snggest that the Gibbs sampler exhibits extremely 
slow mixing for posterior simulation under Gauchy priors for data with separation. Thus 
we consider alternative MGMG sampling algorithms in the hope of improving mixing. A 
random walk Metropolis algorithm shows some improvement over the Gibbs sampler in the 
p = 2 case. However, it is not efficient for exploring higher dimensional spaces. Thus we 
have been motivated to use the software Stan (Garpenter et al.||2016), which implements the 


No-U-Turn Sampler (NUTS) of Hoffman and Gelman (2014), a tuning free extension of the 


Hamiltonian Monte Garlo (HMG) algorithm (Neal 2011) 


It has been demonstrated that for continuous parameter spaces, HMG can improve over 
poorly mixing Gibbs samplers and random walk Metropolis algorithms. HMG is a Metropo¬ 
lis algorithm that generates proposals based on Hamiltonian dynamics, a concept borrowed 
from Physics. In HMG, the parameter of interest is referred to as the “position” variable, 
representing a particle’s position in a p-dimensional space. A p-dimensional auxiliary pa¬ 
rameter, the “momentum” variable, is introduced to represent the particle’s momentum. In 
each iteration, the momentum variable is generated from a Gaussian distribution, and then 
a proposal of the position momentum pair is generated (approximately) along the trajectory 
of the Hamiltonian dynamics dehned by the joint distribution of the position and momen¬ 
tum. Hamiltonian dynamics changing over time can be approximated by discretizing time 
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via the “leapfrog” method. In practice, a proposal is generated by applying the leapfrog al¬ 
gorithm L times, with stepsize e, to the the current state. The proposed state is accepted or 
rejected according to a Metropolis acceptance probability. Section 5.3.3 of the review paper 


by Neal (2011) illustrates the practical benehts of HMC over random walk Metropolis algo¬ 
rithms. The examples in this section demonstrate that the momentum variable may change 
only slowly along certain directions during leapfrog steps, permitting the position variable 
to move consistently in this direction for many steps. In this way, proposed states using 
Hamiltonian dynamics can be far away from current states but still achieve high acceptance 
probabilities, making HMC more efficient than traditional algorithms such as random walk 
Metropolis. 

In spite of its advantages, HMC has not been very widely used in the Statistics com¬ 
munity until recently, because its performance can be sensitive to the choice of two tuning 
parameters: the leapfrog stepsize e and the number of leapfrog steps L. Very small e can lead 
to waste in computational power whereas large e can yield large errors due to discretization. 
Regarding the number of leapfrog steps L, if it is too small, proposed states can be near 
current states and thus resemble random walk. On the other hand, if L is too large, the 
Hamiltonian trajectory can retrace its path so that the proposal is brought closer to the 
current value, which again is a waste of computational power. 

The NUTS algorithm tunes these two parameters automatically. To select L, the main 
idea is to run the leapfrog steps until the trajectory starts to retrace its path. More specif¬ 
ically, NUTS builds a binary tree based on a recursive doubling procedure, that is similar 


in flavor to the doubling procedure used for slice sampling by Neal (2003), with nodes of 
the tree representing position momentum pairs visited by the leapfrog steps along the path. 
The doubling procedure is stopped if the trajectory starts retracing its path, that is making 
a “U-turn”, or if there is a large simulation error accumulated due to many steps of leapfrog 
discretization. NUTS consists of a carefully constructed transition kernel that leaves the 
target joint distribution invariant. It also proposes a way for adaptive tuning of the stepsize 
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e. 


We find that by implementing this tuning free NUTS algorithm, available in the freely 
available software Stan, substantially better mixing than the Gibbs sampler can be achieved 
in all of our examples in which posterior means exist. We still include the Gibbs sampler 
in this article for two main reasons. First, it illustrates that Stan can provide an incredible 
improvement in mixing over the Gibbs sampler in certain cases. Stan requires minimal 
coding effort, much less than developing a Gibbs sampler, which may be useful information 
for readers who are not yet familiar with Stan. Second, Stan currently works for continuous 
target distributions only, but discrete distributions for models and mixed distributions for 
regression coefficients frequently arise in Bayesian variable selection, for regression models 


with binary or categorical response variables (Holmes and Held 2006 Mitra and Dunson 


2010 Ghosh and Glyd^|2011 Ghosh et al.||2011 Ghosh and Reiter||2013 ; Li and Glyde||2015 ). 
Unlike HMG algorithms, Gibbs samplers can typically be extended via data augmentation 
to incorporate mixtures of a point mass and a continuous distribution, as priors for the 
regression coefficients, without much additional effort. 


4 Simulated Data 

In this section, we use two simulation examples to empirically demonstrate that under in¬ 
dependent Gauchy priors, the aforementioned MGMG algorithm for logistic regression may 
suffer from extremely slow mixing in the presence of separation in the dataset. 

For each simulation scenario, we hrst standardize the predictors following the recommen¬ 


dation of Gelman et ah (2008). Binary predictors (with 0/1 denoting the two categories) 


are centered to have mean 0, and other predictors are centered and scaled to have mean 0 
and standard deviation 0.5. Their rationale is that such standardizing makes the scale of a 
continuous predictor comparable to that of a symmetric binary predictor, in the sense that 


they have the same sample mean and sample standard deviation. Gelman et ah (2008) made 
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a distinction between input variables and predictors, and they suggested standardizing the 
input variables only. For example, temperature and humidity may be input variables as well 
as predictors in a model; however, their interaction term is a predictor but not an input 
variable. In our examples, except for the constant term for the intercept, all other predictors 
are input variables and standardized appropriately. 

We compare the posterior distributions under independent i) Cauchy, i.e., Student-f with 
1 degree of freedom, ii) Student-f with 7 degrees of freedom (tj), and iii) normal priors for the 
regression coefficients. In binary regression models, while the inverse cumulative distribution 
function (CDF) of the logistic distribution yields the logit link function, the inverse CDF 


of the Student-t distribution yields the robit link function. Liu (2004) showed that the 
logistic link can be well approximated by a robit link with 7 degrees of freedom. So a fr 
prior approximately matches the tail heaviness of the logistic likelihood underlying logistic 


regression. For Cauchy priors we use the default choice recommended by Gelman et al. 


(2008): all location parameters are set to 0 and scale parameters are set to 10 and 2.5 for the 
intercept and other coefficients, respectively. To be consistent we use the same location and 


scale parameters for the other two priors. Gelman et al. (2008) adopted a similar strategy 
in one of their analyses, to study the effect of tail heaviness of the priors. Among the priors 
considered here, the normal prior has the lightest tails, the Gauchy prior the heaviest, and 
the t-j prior offers a compromise between the two extremes. For each simulated dataset, we 


run both the Gibbs sampler developed in Section [3T] and Stan, for 1,000,000 iterations after 
a burn-in of 100,000 samples, under each of the three priors. 


4.1 Complete Separation with a Solitary Separator 

First, we generate a dataset with p = 2 (including the intercept) and n = 30. The contin¬ 
uous predictor X 2 is chosen to be a solitary separator (after standardizing), which leads to 
complete separation, whereas the constant term Xi contains all one’s and is not a solitary 
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separator. A plot of y versus X 2 in Figure [^demonstrates this graphically. So by Theorem 
under independent Cauchy priors, E^fii \ y) exists but E{P 2 \ y) does not. 
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Figure 2: Scatter plot of y versus X 2 in the hrst simulated dataset, where X 2 is a solitary 
separator which completely separates the samples (the vertical line at zero separates the 
points corresponding to ?/ = 1 and y = 0). 

The results from the Gibbs sampler are reported in Figures [^ and [^ Figure [^ shows 
the posterior samples of (3 under the different priors. The scale of [32, the coefficient corre¬ 
sponding to the solitary separator X 2 , is extremely large under Cauchy priors, less so under 
priors, and the smallest under normal priors. In particular, under Cauchy priors, the 
posterior distribution of [32 seems to have an extremely long right tail. Moreover, although 
Xi is not a solitary separator, under Cauchy priors, the posterior samples of [3i have a much 
larger spread. Figure [^ shows that the running means of both [3i and [32 converge rapidly 
under normal and ty priors, whereas under Cauchy priors, the running mean of [3i does not 
converge after a million iterations and that of [32 clearly diverges. We also ran Stan for 
this example but do not report the results here, because it gave warning messages about 
divergent transitions for Cauchy priors, after the burn-in period. Given that the posterior 
mean of [32 does not exist in this case, the lack of convergence is not surprising. 
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Figure 3: Scatter plots (top) and box plots (bottom) of posterior samples of (3i and {32 from 
the Gibbs sampler, under independent Cauchy, ^ 7 , and normal priors for the hrst simulated 
dataset. 

4.2 Complete Separation Without Solitary Separators 

Now we generate a new dataset with p = 2 and n = 30 such that there is complete separation 
but there are no solitary separators (see Figure]^. This guarantees the existence of both 
E{(3i I y) and E{(32 \ y) under independent Cauchy priors. The difference in the existence 
of E{f32 I y) for the two simulated datasets is reflected by the posterior samples from the 
Gibbs sampler: under Cauchy priors, the samples of (32 in Figure 1 in the Appendix are 
more stabilized than those in Figure in the manuscript. However, when comparing across 
prior distributions, we hnd that the posterior samples of neither f3i nor (32 are as stable as 
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Figure 4: Plots of running means of /3i (top) and [32 (bottom) sampled from the posterior 
distributions via the Gibbs sampler, under independent Cauchy, tf, and normal priors for 
the hrst simulated dataset. Here E{(3i \ y) exists under independent Cauchy priors but 
E[I32 I y) does not. 

those under t-j and normal priors, which is not surprising because among the three priors, 
Cauchy priors have the heaviest tails and thus yield the least shrinkage. Figure 2 in the 
Appendix shows that the convergence of the running means under Cauchy priors is slow. 
Although we have not verified the existence of the second or higher order posterior moments 
under Cauchy priors, for exploratory purposes we examine sample autocorrelation plots of 
the draws from the Gibbs sampler. Figure shows that the autocorrelation decays extremely 
slowly for Cauchy priors, reasonably fast for t-j priors, and rapidly for normal priors. 
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Some results from Stan are reported in Figures 3 and 4 in the Appendix. Figure 3 in 
the Appendix shows posterior distributions with nearly identical shapes as those obtained 
using Gibbs sampling in Figure 1 in the Appendix, with the only difference being that more 
extreme values appear under Stan. This is most likely due to faster mixing in Stan. As Stan 
traverses the parameter space more rapidly, values in the tails appear more quickly than 
under the Gibbs sampler. Figures 2 and 4 in the Appendix demonstrate that running means 
based on Stan are in good agreement with those based on the Gibbs sampler. 

The autocorrelation plots for Stan in Figure demonstrate a remarkable improvement 
over those for Gibbs in Figure for all priors, and the difference in mixing is the most 
prominent for Gauchy priors. 

To summarize, all the plots unequivocally suggest that Gauchy priors lead to an extremely 
slow mixing Gibbs sampler and unusually large scales for the regression coefficients, even 
when all the marginal posterior means are guaranteed to exist. While mixing can be improved 
tremendously with Stan, the heavy tailed posteriors under Stan are in agreement with those 
obtained from the Gibbs samplers. One may argue that in spite of the unnaturally large 
regression coefficients, Gauchy priors could lead to superior predictions. Thus in the next 
two sections we compare predictions based on posteriors under the three priors for two real 
datasets. As Stan generates nearly independent samples, we use Stan for MGMG simulations 
for the real datasets. 


5 Real Data 


5.1 SPECT Dataset 


The “SPEGT” dataset (Kurgan et al. 2001) is available from the UGl Machine Learning 


Repositorj|^ The binary response variable is whether a patient’s cardiac image is normal or 
abnormal, according to the diagnosis of cardiologists. The predictors are 22 binary features 


ffittps://archive.ics.uci.edu/ml/datasets/SPECT+Heart 
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X2 


Figure 5: Scatter plot of y versus X 2 for the second simulated dataset. The solid vertical 
line at —0.3 demonstrates complete separation of the samples. However, X 2 is not a solitary 
separator, because the dashed vertical line at zero does not separate the points corresponding 
to ?/ = 1 and ?/ = 0. The other predictor Xi is a vector of ones corresponding to the intercept, 
which is not a solitary separator, either. 


obtained from the cardiac images using a machine learning algorithm. The goal of the study 
is to determine if the predictors can correctly predict the diagnoses of cardiologists, so that 
the process could be automated to some extent. 

Prior to centering, two of the binary predictors are solitary quasicomplete separators: 
Xjj = 0 Vi G Hq and Xij > 0 Vi G Hi, for j = 18,19, with Xi denoting the column 


of ones. Ghosh and Reiter (2013) analyzed this dataset with a Bayesian probit regression 
model which incorporated variable selection. As some of their proposed methods relied on 
an approximation of the marginal likelihood based on the MLE of f3, they had to drop these 
potentially important predictors from the analysis. If one analyzed the dataset with the 
uncentered predictors, by Theorem]^ the posterior means E{{3is I y) and E{{3ig \ y) would 
not exist under independent Cauchy priors. However, after centering there are no solitary 
separators, so the posterior means of all coefficients exist. 

The SPECT dataset is split into a training set of 80 observations and a test set of 187 
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Figure 6: Autocorrelation plots of the posterior samples of /3i (top) and /?2 (bottom) from the 
Gibbs sampler, under independent Cauchy, G, and normal priors for the second simulated 
dataset. 


observations by Kurgan et ah (2001). We use the former for model fitting and the latter for 
prediction. First, for each of the three priors (Cauchy, fy, and normal), we run Stan on the 
training dataset, for 1,000,000 iterations after discarding 100,000 samples as burn-in. 

As in the simulation study, MCMC draws from Stan show excellent mixing for all priors. 
However, the posterior means of the regression coefficients involved in separation are rather 
large under Cauchy priors compared to the other priors. For example, the posterior means of 
(/5 i8 ,/3 i9 ) under Cauchy, ty, and normal priors are (10.02,5.57), (3.24,1.68), and (2.73,1.43) 
respectively. These results suggest that Cauchy priors are too diffuse for datasets with 
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Figure 7: Autocorrelation plots of the posterior samples of (3i (top) and (32 (bottom) from 
Stan, under independent Cauchy, ^ 7 , and normal priors for the second simulated dataset. 


separation. 

Next for each i = 1,2,... ,ntest in the test set, we estimate the corresponding success 
probability vr* by the Monte Carlo average: 


-MC 

VT,- 




S' ^ 1 + ’ 

S = 1 


( 11 ) 


where is the sampled value of (3 in iteration s, after burn-in. Recall that here ntest = 187 
and S = 10®. We calculate two different types of summary measures to assess predictive 
performance. We classify the ith observation in the test set as a success, if > 0.5 and as 
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a failure otherwise, and compute the misclassihcation rates. Note that the misclassihcation 
rate does not fully take into account the magnitude of For example, ii Ui = 1 both 

^Mc _ g_g -^Quld correctly classify the observation, while the latter may be 
more preferable. So we also consider the average squared difference between yi and vff''": 


h^test 


MSE^^ = —y 

ntest ^ 


TT, 


MC 


Vi) , 


( 12 ) 


i=l 


which is always between 0 and 1, with a value closer to 0 being more preferable. Note that 
the Brier score (Brier 1950) equals 2MSE^^, according to its original dehnition. Since in 


some modihed dehnitions ( |Blattenberger and Lad||l985 ), it is the same as MSE^^, we refer 
to MSE^^ as the Brier score. 


Table 1: Misclassihcation rates based on and under Cauchy, tj, and normal priors 
for the SPECT data. Small values are preferable. 



Gauchy 

tj 

normal 

MGMG 

0.273 

0.257 

0.251 

EM 

0.278 

0.262 

0.262 


Table 2: Brier scores MSE^^ and MSE^^, under Cauchy, tj, and normal priors for the 
SPECT data. Small values are preferable. 



Gauchy 

tj 

normal 

MGMG 

0.172 

0.165 

0.163 

EM 

0.179 

0.178 

0.178 


To compare the Monte Carlo estimates with those based on the EM algorithm of Gelman 


et ah 


(2008), we also estimate the posterior mode, denoted by /3 under identical priors and 


hyperparameters, using the R package arm (Gelman et ah 2015). The EM estimator of TTj is 
given by: 

(13) 


--EM 
TT,- = 




1 q. gx/ (3 


and MS'S™ is calculated by replacing by in (12). 


We report the misclassihcation rates in Table and the Brier 


scores 


in Table ^ MGMG 


26 























achieves somewhat smaller misclassihcation rates and Brier scores than EM, especially nnder 
t-j and normal priors. This snggests that a fnll Bayesian analysis using MCMC may produce 
estimates that are closer to the truth than modal estimates based on the EM algorithm. 
The predictions are similar across the three prior distributions with the normal and priors 
yielding slightly more accurate results than Cauchy priors. 

5.2 Pima Indians Diabetes Dataset 

We now analyze the “Pima Indians Diabetes” dataset in the R package MASS. This is a classic 
dataset without separation that has been analyzed by many authors in the past. Using this 
dataset we aim to compare predictions under different priors, when there is no separation. 
Using the training data provided in the package we predict the class labels of the test data. 
In this case the difference between different priors is practically nil. The Brier scores are 
same up to three decimal places, across all priors and all methods (EM and MCMC). The 
misclassification rates reported in Table also show negligible difference between priors 
and methods. Here Cauchy priors have a slightly better misclassification rate compared to 
normal and priors, and MCMC provides slightly more accurate results compared to those 
obtained from EM. These results suggest that when there is no separation and maximum 
likelihood estimates exist, Cauchy priors may be preferable as default weakly informative 
priors in the absence of real prior information. 

Table 3: Misclassification rates based on and under Cauchy, fy, and normal priors 
for the Pima Indians data. Small values are preferable. 



Cauchy 

h 

normal 

MCMC 

0.196 

0.199 

0.199 

EM 

0.202 

0.202 

0.202 
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6 Discussion 


We have proved that posterior means of regression coefficients in logistic regression are not 


always guaranteed to exist under the independent Cauchy priors recommended by Gelman 


et al. (2008), if there is complete or quasicomplete separation in the data. In particular, we 


have introduced the notion of a solitary separator, which is a predictor capable of separating 
the samples on its own. Note that a solitary separator needs to be able to separate without 
the aid of any other predictor, not even the constant term corresponding to the intercept. 
We have proved that for independent Cauchy priors, the absence of solitary separators is 
a necessary condition for the existence of posterior means of all coefficients, for a general 
family of link functions in binary regression models. For logistic and probit regression, this 
has been shown to be a sufficient condition as well. In general, the sufficient condition 
depends on the form of the link function. We have also studied multivariate Cauchy priors, 
where the solitary separator no longer plays an important role. Instead, posterior means of 
all predictors exist if there is no separation, while none of them exist if there is complete 
separation. The result under quasicompelte separation is still unclear and will be studied in 
future work. 

In practice, after centering the input variables it is straightforward to check if there are 
solitary separators in the dataset. The absence of solitary separators guarantees the exis¬ 
tence of posterior means of all regression coefficients in logistic regression under independent 
Cauchy priors. However, our empirical results have shown that even when the posterior 
means for Cauchy priors exist under separation, the posterior samples of the regression co¬ 
efficients may be extremely large in magnitude. Separation is usually considered to be a 
sample phenomenon, so even if the predictors involved in separation are potentially im¬ 
portant, some shrinkage of their coefficients is desirable through the prior. Our empirical 
results based on real datasets have demonstrated that the default Cauchy priors can lead 
to posterior means as large as 10, which is considered to be unusually large on the logit 
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scale. Our impression is that Cauchy priors are good default choices in general because they 
contain weak prior information and let the data speak. However, under separation, when 
there is little information in the data about the logistic regression coefficients (the MLE is 
not hnite), it seems that lighter tailed priors, such as Student-t priors with larger degrees 
of freedom or even normal priors, are more desirable in terms of producing more plausible 
posterior distributions. 

From a computational perspective, we have observed very slow convergence of the Gibbs 
sampler under Cauchy priors in the presence of separation. Note that if the design matrix 
is not of full column rank, for example when p > n, the p columns of X will be linearly 
dependent. This implies that the equation for quasicomplete separation ([^ will be satished 
with equality for all observations. Empirical results (not reported here for brevity) demon¬ 
strated that independent Cauchy priors show convergence of the Gibbs sampler in this case 
also compared to other lighter tailed priors. Out-of-sample predictive performance based on 
a real dataset with separation did not show the default Cauchy priors to be superior to 
or normal priors. 


In logistic regression, under a multivariate normal prior for /3, Choi and Robert (2013) 
showed that the Polya-Gamma data augmentation Gibbs sampler of Poison et ah ( 2013[ ) is 
uniformly ergodic, and the moment generating function of the posterior distribution p{f3 \ y) 
exists for all X,y. In our examples of datasets with separation, the normal priors led to 
the fastest convergence of the Gibbs sampler, reasonable scales for the posterior draws of f3, 
and comparable or even better predictive performance than other priors. The results from 
Stan show no problem in mixing under any of the priors. However, the problematic issue 
of posteriors with extremely heavy tails under Cauchy priors cannot be resolved without 
altering the prior. Thus, after taking into account all the above considerations, for a full 
Bayesian analysis we recommend the use of normal priors as a default, when there is sepa¬ 
ration. Alternatively, heavier tailed priors such as the tj could also be used if robustness is 
a concern. On the other hand, if the goal of the analysis is to obtain point estimates rather 
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than the entire posterior distribntion, the posterior mode obtained from the EM algorithm 


of Gelman et al. (2015) nnder defanlt Canchy priors (Gelman et al. 2008) is a fast viable 


alternative. 


Supplementary Material 

In the snpplementary material, we present additional simnlation resnlts for logistic and 
probit regression with complete separation, along with an appendix that contains the proofs 
of all theoretical resnlts. The Gibbs sampler developed in the paper can be implemented 
with the R package tglm, available from the website: https://cran.r-project.org/web/ 
packages/tglm/index.html. 
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Supplementary Material: On the Use of Cauehy 
Prior Distributions for Bayesian Logistie Regression 


In this supplement, we first present an appendix with additional simulation results for 
logit and probit link functions, and then include an appendix that contains proofs of the 
theoretical results. 


1 Appendix: Simulation Results for Complete Separa¬ 
tion Without Solitary Separators 

In this section we present some supporting figures for the analysis of the simulated dataset 


described in Section 4.2 of the manuscript under logit and probit links. 


1.1 Logistic Regression for Complete Separation Without Solitary 
Separators 

Figures and are based on the posterior samples from a Gibbs sampler under a logit link, 
whereas Figures and are corresponding results from Stan. A detailed discussion of the 


results is provided in Section 4.2 of the manuscript 


1.2 Probit Regression for Complete Separation Without Solitary 
Separators 


In this section we analyze the simulated dataset described in Section 4^ of the manuscript 
under a probit link, while keeping everything else the same. We have shown in Theorem 
that the theoretical results hold for a probit link. The goal of this analysis is to demon¬ 
strate that the empirical results are similar under the logit and probit link functions. For 
this dataset. Theorem guarantees the existence of both E{f3i \ y) and F^(/?2 I y) under 
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Figure 1: Scatter plots (top) and box plots (bottom) of posterior samples of /3i and {32 for 
a logistic regression model, from the Gibbs sampler, under independent Cauchy, G, and 
normal priors for the second simulated dataset. 


independent Cauchy priors and a probit link function. As in the case of logistic regression 
the heavy tails of Cauchy priors translate into an extremely heavy right tail in the posterior 
distributions of {3i and {32, compared to the lighter tailed priors (see Figure and here). 
Thus in the case of separation, normal priors seem to be reasonable for probit regression 
also. 
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Figure 2: Plots of running means of /9i (top) and /32 (bottom) sampled from the posterior dis¬ 
tributions for a logistic regression model, via the Gibbs sampler, under independent Cauchy, 
^ 7 , and normal priors for the second simulated dataset. Here both E[(3i \ y) and E{(32 \ y) 
exist under independent Cauchy priors. 

2 Appendix: Proofs 


First, we decompose the proof of Theorem [T] into two parts: in Appendix]^ we show that a 
necessary condition for the existence of E{l3j \ y) is that is not a solitary separator; and 
in Appendix we show that it is also a sufficient condition. Then, we prove Theorem in 
Appendix and Corollary in Appendix Finally, we decompose the proof of Theorem 
into two parts: in Appendix we show that all posterior means exist if there is no 
separation; then in Appendix we show that none of the posterior means exist if there is 
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Figure 3: Scatter plots (top) and box plots (bottom) of posterior samples of (3i and (32 for 
a logistic regression model, from Stan, under independent Cauchy, tj, and normal priors for 
the second simulated dataset. 

complete separation. 


A Proof of the Necessary Condition for Theorem 

Here we show that if Xj is a solitary separator, then E{(3j \ y) does not exist, which is 
equivalent to the necessary condition for Theorem [T} 

Proof. For notational simplicity, we dehne the functional form of the success and failure 
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Figure 4: Plots of running means of /3i (top) and (32 (bottom) sampled from the posterior 
distributions for a logistic regression model, via Stan, under independent Cauchy, t^, and 
normal priors for the second simulated dataset. Here both E{/3i \ y) and E{P 2 \ y) exist 
under independent Cauchy priors. 

probabilities in logistic regression as 

/i(t) = eV(l + e*), /o(t) = 1 —/i(t) = 1/(1 + e*), (A.l) 

which are strictly increasing and decreasing functions of t, respectively. In addition, both 
functions are bounded: 0 < fi(t), fo(t) < 1 for t G M. Let (3{-j) and denote the vectors 
f3 and Xj after excluding their jth entries /3j and Xij, respectively. Then the likelihood 
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Figure 5: Scatter plots (top) and box plots (bottom) of posterior samples of (3i and / 92 , for 
a probit regression model, under Cauchy, ^ 7 , and normal priors for the second simulated 
dataset. Posterior sampling was generated via Stan. 

function can be written as 

n 

p(y I /3) = n /o . (A. 2 ) 

i=l i&Ai keAo 

The posterior mean of (3j exists provided E{\(3j\ | y) < 00 . When the posterior mean exists 
it is given by (|^. Clearly if one of the two integrals in (|^ is not hnite, then E(\/3j\ \ y) = 00 . 
In this proof we will show that if aj > 0, the hrst integral in (|^ equals 00 . Similarly, it can 
be shown that if aj < 0 , the second integral in (|^ equals — 00 . 
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Figure 6: Plots of running means of /3i (top) and ^2 (bottom) sampled from the posterior 
distributions for a probit regression model, under Cauchy, tj, and normal priors for the 
second simulated datatset. Here both E{(3i \ y) and E{(32 \ y) exist under Cauchy priors. 
Posterior sampling was generated via Stan. 

If > 0, by (§-(§, Xij > 0 for all i G Hi, and Xkj < 0 for all k E Aq. When > 0, 
by the monotonic property of /i(t) and /o(f) we have, p(y | (3) > HisAi (<(-h/3(-h) ■ 
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df3j 

(A.3) 


Here the hrst equation results from independent priors, i.e., p{f3(-j) \ (3j) = p{f3(^-j)). Since 
p{y I /3) < 1, p(y) = /iRpP(y I f3)p{f3)d(3 < J^^p{(3)d(3 = 1. Moreover, p(y | (3)p{(3) > 0 
for all /3 G M^, so we also have p(y) > 0, implying that 0 < p(y) < 1. For the independent 
Cauchy priors in ([^, JjPiJj) djdj = oo and the second integral in (A.3) is positive, hence 
(A.3) equals oo. □ 


B Proof of the Sufficient Condition for Theorem [T] 

Here we show that if is not a solitary separator, then the posterior mean of fdj exists. 
Proof. When E{\(3j\ | y) < oo the posterior mean of /3j exists and is given by 

/ OO 

JjPWj I y)d^j 

■oo 

= (djP{(dj)p{y I (dj)d( 5 j j (djP{(dj)p{y I (dj)d[ 5 j. (B.l) 

''---^ ''-V-^ 

denoted by E(/3j|y)+ denoted by E(/3j|y)- 

Since 0 < p(y) < 1 it is enough to show that the positive term E{(]j \ y)+ has a hnite upper 
bound, and the negative term E{(3j \ y)~ has a hnite lower bound. For notational simplicity, 
in the remainder of the proof, we let ct{-j), and cr(_j) denote the vectors a, x*, 

/3, and cr = (ai,..., cXp)^ after excluding their jth entries, respectively. 
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We first show that E{l3j \ y)^ has a finite upper bound. Because Xj is not a 

solitary separator, there exists either 

(a) an i' G Ai such that Xj/j < 0, or 

(b) a /c' G Ao, such that Xk'j > 0. 


If both (a) and (b) are violated then Xj is a solitary separator and leads to a contradiction. 
Furthermore, for such x*/or it contains at least one non-zero entry. This is 

because if j ^ 1, i.e., Xj does not correspond to the intercept (the column of all one’s), 
then the hrst entry in or equals 1. If j = 1, due to the assumption that X 

has a column rank > 1, there exists one row G {1,2,..., n} such that contains at 


least one non-zero entry. If G Aq, then we let k' = and the condition (b) holds because 
Xjo 1 = 1 > 0. If G y4i, we may hrst rescale Xi by —1, which transforms /3i to —/3i. Since 
the Cauchy prior centered at zero is invariant to this rescaling, and \ y) exists if and 

only if E{l3i \ y) exists, we can just apply this rescaling, after which = —1 < 0. Then 

we let i' = E and the condition holds. 


We hrst assume that condition (a) is true. We dehne a positive constant e = |xi/jj/2 = 
-Xi/j/2. For any jSj > 0, we dehne a subset of the domain of (3{-j) as follows 


G(ft) s {A-,) e 


° ‘ < A} ■ 


(B.2) 


Then for any /3(_j) G G{l3j), = Xi/j/Sj + x^ < (xj/j -F e)l3j = -e/Sj. Therefore, 


/i(x^/3) < for all /3(_j) G 


(B.3) 


As /i(-) and /o(-) are bounded above by 1, the likelihood function p{y \ (3) in (A.2) is 
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bounded above by /i(x^/3). Thus 


m I y)-" = / 


p(y I P)p{P{-j))dp{-j) 


d(3j 


< / ldjP{(dj) 


I3jp{f3j 


< / I3jp{l3j 

Jo 


/i(x^/3)p(/3(_^))d/3(_j) 


djJj 


/ /i(x^/3M/3(-j))c?/3(-i) + / /i(x^/3)p(/3(_j))d/3(_j) 

/g(/ 3 ^) dMP-i\G(/ 3 ^) 


/ /i(-e/5j)p(/3(-j))d/3(_j) + / p(/3(_j))d/3(_j) 

/g(/3j) dRp-i\G(/3j) 


d/3j 


d(3j 

(B.4) 


Here the last inequality results from (B.3) and the fact that the function /i(-) is bounded 


above by 1. An upper bound is obtained for the hrst term in the bracket in (B.4) using the 
fact that the integrand is non-negative as follows: 


'GWP 






= /i(-e/3i) / PiP{-j))dP{-j) = , " .g. < e (B.5) 


1 -f 


—V-* 

= 1 


Recall that Xj/contains at least one non-zero entry. We assume that Xj/^^ 7^ 0. Then to 


simplify the second term in the bracket in (B.4), we transform /3(-j) to via a linear 

transformation, such that p = and ^ is the vector (3(-j) after excluding fdr- The 

characteristic function of a Cauchy distribution is ip{t) = where t G M. 

Since a priori, r; is a linear combination of independent (7(0, random variables, for 
1 < £ < p, £ 7 ^ j, its characteristic function is 


So the induced prior of p is ^et |xj/^(_j)| denote the vector ob¬ 

tained by taking absolute values of each element of Xj/^(_j), then the above scale parameter 
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Y.i<e<p/^j\^i'/Wi= % ( |B-2D , for any /3(_j) ^ G{(3j), the corresponding 


rj > e(3j and ^ G 


0-2 


. An npper bonnd is calculated for the second term in the bracket 


in (B.4). Note that p{rj)p{^ \ p) is the joint density of p and Since it incorporates the 
Jacobian of transformation from f3{-j) to p and a separate Jacobian term is not needed 
in the hrst equality below. 


AG(/3d 




p{p)p{^ I p)d^dp 


€/3j JRP- 


/ rp -2 P{$, 

p)d^ 

71 



1 + hV (|x*',(-i) 

^^(-3))\ 


dp 


1 

TT 


TT 

- arctan 

2 


€(3j 


IT 


- ^ irctin ( \ ^ ^i-j) 


(B,6) 


Here, the second equality holds because fjg^p- 2 p(^ | p)d^ = 1; the last inequality holds 
because e and (3j are both positive, and for any f > 0, arctan(f) < t. 

Then substituting the expression for p{/3j) as in (|^, we continue with (B.4) to hnd an 
upper bound. 


m I y)^ < 




'o 7rcTj(l + . 


^(-j) 


< 


^—dHj + 
1 vra,- 


nejSj 
■i'A-j) I ^{- 3 ) 


d(3j 


f*oo 1^ \T ^ 

Xj/ ( — 


'0 7rVje(l + 


jp _ ^ I i)l 3 ) /p 

2 I^2\^P3 - _ 9 -Ni:-• iB.Yj 


TTa^e^ 


27re 


On the other hand, if condition (b)| holds, then we just need to slightly modify the above 


proof. We dehne e = \xk'j\/2 = Xk',j/2, and change (B.2) to 


G(ft) {/3,_„ e 


D-1 


: X 


k' 




Consequently, the terms fi{xJ,/3) and fi{—ef3j) in ( B.4[ ) have to be changed to /o(x|’,/3) and 
fo{e(3j), respectively. For the logit link, fo{e(3j) = fi{—e(3j). The range of the integral in 
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(B.6) with respect to rj is from —cxd to however, because the density of rj is symmetric 


around 0, the value of the integral stays the same. So it can be shown an upper bound for 
I y)+ is [1/ (vra^e^)] + (27re)]. 

We now show that the negative term E{l3j | y)” has a finite lower bound. For 
any [3j <0, by expressing (3* = —(3j, we need to show that the positive term —E[(3j | y)“ = 
~/°oo I ~ l^jP{f^j)piy I has a hnite upper bound. As the 

idea is very similar to the proof of existence of E{f3j \ y)+, we present less details here. 

Since the predictor is not a solitary separator, there exists either 

(c) an i* G Ai such that Xi*j > 0, or 


(d) a k* G Ao, such that Xk*,j < 0. 


If 


and (d) are both violated Xj has to be a solitary separator, which leads to a contra¬ 


diction. WLOG, we assume that condition (c) is true, and as before must contain 


at least one non-zero entry, say, Xj* * ^ 0. If condition (d) is true, then we can adopt a 


modification similar to the one that is used to prove the existence under condition (b) based 
on the proof under condition (j 


We dehne a positive constant e = Xi*j/2. For any f3* > 0, we define a subset of ^ 


0-1 


as 


GWj) {A-j) e ^ : x^ < e/?*|. Then for all/3(_j) G yij,f3 = -Xi*j/3* + 

x^ + e)(3* = hence /i(x^/3) < fi (—e/?*). Since the likelihood 

function p{y \ -/3*,/3(_j)) < /i(xi* j(-/3*) -F xf* < 1, 


m I y)’ = / Ppm 


p{y I -A,A-t))p(A-i))^A-i) 


dm 


< 


ppm 


/ /i(-eA)p(/3(_j))d/3(_j) + / p(/3(_j))d/3(_j) 


dm. (B.8) 


The hrst term in the bracket in (B.8) has an upper bound exp(—e/?*) as in (B.5). Recall 


that Xj*^s 7^ 0. We now transform /3(-j) to {ri,$,) via a linear transformation, such that r] = 
* (_j)/3(-j) and ^ is the vector (3{-j) after excluding (3s. The prior of rj is (7(0, |xj*_(_j)|'^cr(_j)). 


T 

X. 
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For any /3(_j) ^ the corresponding r] > e/3*. Therefore as in (B.6), we obtain an 

npper bonnd for the second term in the bracket in (|B.8) as |xj* (vre/?*). Finally, 


following (B.7) an npper bonnd for —E{(3j \ y) is [1/ (vrcTje^)] + [|xi*,(-j)|^cr(-^-)/(27re)]. □ 


C Proof of Theorem [2] 

Proof. Following the proof of Theorem we denote the snccess probability fnnction by 
TT = /i(x^/3), where /i(-) is the inverse link fnnction, i.e., /i(-) = Similarly, let the 

failnre probability fnnction be /o(x^/3) = 1 — /i(x^/3). Note that the proof of the necessary 
condition for Theorem given in Appendix only relies on the fact the /i(-) is increasing, 
continnous, and bonnded between 0 and 1. Since the link fnnction g{-) is assnmed to be 
strictly increasing and differentiable, so is /i(-)- Moreover, the range of fi is (0,1). Therefore, 
the proof of the necessary condition for Theorem follows immediately. 

For the proof of the snfficient condition in Theorem one can follow the proof in Ap¬ 
pendix]^ and proceed with the specihc choice of e used there, when condition holds. The 


proof is identical until (B.4) because the explicit form of the inverse link function is not used 


until that step. We re-write the hnal step in (B.4) below and proceed from there: 


m I y)"" 

POO 

< / ^jpwj) 


/ /i(-e/3j)p(/3(_j))d/3(_j) + / p(/3(_j))d/3(_j) 

'G(/3P 3rp-i\G(/3^) 


df3j 




'G(/3d 




djdj 




poo 

/ Pif^i-j)W{-j) 

Jo 

3rp-i\G(/3P 


d(3j 


< l3jPil3j) fii-el3j) dl3j + 

=9“T-c/3i) 


/rp-i\G(/3j) 




d^j. (C.l) 


43 






















The sufficient condition in Theorem states that for every positive constant e, 


PjP{Pj)9 {-^Pj)dPj < oo. 


This implies that the integral will be bounded for the specihc choice of e used in the above 


proof, and hence the first integral in (C.l) is bounded above. The second integral does 


not depend on the link function and its bound can be obtained exactly as in Appendix 
Thus E{l3j I y)"*" < oo under condition (a). On the other hand if condition (b) holds, the 
proof follows similarly as in Appendix and now we need to use the sufficient condition in 
Theorem that for every positive constant e, (3 jp{Pj) [1 — g~^{e(3j)] d(3j < oo. A bound 

for —E{(3j I y)“ can be obtained similarly, which completes the proof. 

In probit regression, we first show that 


Vobit(^) = < e /(I + e ) = for any t < 0, 


(C.2) 


where <h(f) is the standard normal cdf. It is equivalent to show that the difference function 


u{t) = <h(f) 


1 + e' 


< 0, for all t < 0. 


(C.3) 


Note that m( 0) = 1/2 — 1/2 = 0, and hmt^_oo'u(t) = 0. Since u(t) is differentiable, we have 


// \ 1 —A 6 * 

u (t) = 2 — ^ 

(l + e‘)2 


Now m'( 0) = l/\/^ - 1/4 > 0, and when t is very small, u'{t) < 0 since e decays to 
zero at a faster speed than e*, i.e., there exists a f < 0 such that u'(t) < 0. Since u'(t) is a 
continuous function, the intermediate value theorem applied to [t, 0] shows that there exists 


a fi < 0 such that u'{ti) = 0. Therefore, to show (C.3), it is sufficient to show that ■u'(f) has 


a unique root on M , which is proved by contradiction as follows. 
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If u'{t) has two distinct roots ti,t 2 < 0, i.e., for i = 1,2, u'{ti) = 0, then 


1 

e 2 


o^i 


(*2 + 1)^ 
e 2 

(n+i)^ 
e 2 


(1 + e **)2 

1 + ^ ^ 
1 + 


, z = l,2^ 

. . (^2 + 1)^ 


— h. 
e 2 

_l2 

e 2 


e*^ (1 + e*2)^ 
e*2 (1 + e*i)^ 


- log(l + e*") = ^ - log(l + e*^). 


(C.4) 


Note that the derivative of the function {t + 1)^/4 — log(l + e*) is {t + l)/2 — eV(l + e*). 
It is straightforward to show that this derivative is strictly less than 0 for all f < 0, so 


(t + 1)^/4 — log(l + e*) is a strictly decreasing function. Thus (C.4) holds only if ti = t 2 , 
which leads to a contradiction. 

Hence for any e > 0, 




A-e- 




lo 7rcrj(l+ ^ 


d/3j < I - djdj = 


< oo. 




Tra.-e^ 


Since the probit link is symmetric, i.e., 1 — ~ ^ ~ 4>(e/dj) = ^(—€/3j) = 

£/probit(-e/5i), we also have “ £/probit(e/5i)] ^ 


D Proof of Corollary 

To prove Corollary [TJ we mainly use a similar strategy to the proof of Theorem To show 
the necessary condition, we can use all of Appendix [^without modification, for both logistic 
and probit regression models. To show the sufficient condition, we can follow the same proof 
outline as in Appendix with some minimal modification as described in the following 
proof. 

Proof. First, we denote the vector of prior location parameters by = (/ii,/i 2 , • • •, Pp)'^■ If 
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we shift the coefficients /3 by units, then 


def ^ 

7 = P-M 


7 , ~ C(0,a,), 


j = 1 , 2 , 


that is, the resulting parameters 7 j have independent Cauchy priors with location parameters 
being zero. Since the original parameter jSj = 7 j + /ij for each j = 1 , 2 ,... ,p, the existence 
of E{l3j I Y) is equivalent to the existence of E{pfj | Y). So we just need to show that if X^ 
is not a solitary separator, then i?( 7 j | Y) exists. For simplicity, here we just show that the 
positive term 


I y)^ = I ijpiij)p{y I ij)dyj 


has a hnite upper bound. The other half of the result that the negative term E{'yj \ y)~ has 
a hnite lower bound will follow with a similar derivation. 

As in Appendix we hrst assume that condition (a) is true, and dehne e in the same 
way. For any 7 ^ > 0, we dehne a subset of the domain of 7 (-j) as follows 


Ghj) {7(-i) e MJ’ ^ < e7j} , 


then for any 7 (-j) G G^jj), x ^7 < —ey^-. Since /i(-) is an increasing function. 


/i(x^/3) = /i(x^7 + < fii-eyj + xj,n), for all 7(_7 G 


A similar derivation to (B.4) gives 


Eilj I y)’ 


< / 7iP(7i) 

Jo 


/ /i(-e7j+x^^)p(7(-i))c?7(-i) + / p(7(-i))c?7(-j) 

'G(7d 7MP-i\G(7i) 


d7j. 
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where by (B.5) the hrst integral inside the bracket has an upper bound 


'G(7d 


fli-e-fj + x^^)p(7(_j))d7(_j) < /i(-e7j + x^^), 


(D.l) 


and by (B. 6 ) the second integral inside the bracket also has an upper bound 


p(7(-i))rf7(-i) < -—-^ 

P-7G(7d 


In logistic regression, the right hand side of (D.l) is further bounded 


fli-e-fj + x,,^) = < e 

1 + e 


and hence by (B.7), 


Ehj I y)+ < 


„:xT,u |„ \T „ 

+ ^ ® *_I ^(-j) 

o ~ 1 ~ 


7r(j,e^ 


27re 


In probit regression, the function /i(-) in the above derivations equals the standard 
normal cdf $(■)• By (C.2), for any 7 ^ > x^^/e, we have 


„-e7j+x^/i 

<h {-e'fj + x^^i) <- 
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Hence for x^^/e > 0 we have an npper bonnd 


Ehj I y)^ < / ijPiij) 


<h {-e-fj + + 


X,-/ 


^{-3) 


ire-fj 


dlj 


IjPi'lj)^ {-Hj + d-fj + 


27re 


rxf,fi/e 


'0 


IjPbj)^ (-e7i + djj + / 7iP(7i)*^’ (-e7i + x^m) c?7i + 


X,-/ 


»',( -i)l "(-j) 
2'Ke 


foo -N H/t 

< / 7jP{lj)d7j + / 7iP(7i)e"'^^+^?''d7i + 

Jo JxJmA ^Tre 


< 


/o 7raj(l + 72 /a|) ^ 


d'^j + 




1 


1 + 


1 + 


T \ 2 

X-/M 

T \ 2 

X-/M 
ecT,' 


d7j + 


^0 7raj(l+ 7|/a|) ^ 

oo 


27re 


+ e* 




TTCr,- 


27re 


T 


_l_ ^ """ _|_ ]^*',(-i) I ^(-i) 


vraoe^ 


27re 


Note that a similar derivation also holds if x^^/e < 0. 


On the other hand, if condition (b) is trne, we can follow the same modification in 
Appendix to find npper bonnds in a similar way. □ 

To show Theorem]^ we decompose its proof into two parts: in Appendixwe show that 
all posterior means exist if there is no separation; then in Appendix we show that nnder a 
mnltivariate Cauchy prior, none of the posterior means exist if there is complete separation. 


E Proof of Theorem in the case of no separation 

Proof. For any j = 1,2, ...,p, to show that E[(3j \ y) exists, we just need to show the 


positive term E^fij \ y)"^ in (B.l) has an upper bound, because the negative term E^fij \ y)’ 


in (B.l) having a lower bound follows a similar derivation. 


When working on F^(/3j | y)^, we only need to consider positive (dj. Denote a new p — 1 
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dimensional variable 7 = f3[-j)/(3j, then for i = 1 , 2 ,..., n, 


"f3 = (5j {xij + . 


If there is no separation, for any 7 G there exists at least one i G {1, 2, 

that 


X, 


j + < 0) a i ^ Ai, or Xij + >0, if i G Aq. 


For each i = 1, 2,..., n, denote the vector Zj and the function /ij(-) as follows, 


, , , Xj if i G Ai 

def J U ^ I 

Z — — ' ^ 


, hi{7) = Zij + (_^.)7, 


-X,' if i G An 


, n}, such 


(E.l) 


(E.2) 


then (E.l) can be rewritten as hi{'y) < 0. Denote for i = 1, 2,..., n. 


Bi = {7 : hi{j) < 0 }. 


Then each Bi is a non-empty subset of unless Zij > 0 and = 0. Let X = {z : 

Bi 7 ^ 0 } denote the set of indices i, for which the corresponding Bi are non-empty. Because 
there is no separation, 

IJb, = MP-b (E.3) 


iei 


Hence, the set X is non-empty. We denote its size by q |X|, and rewrite X = { 7 , Z 2 , • • •, iq}- 
Now we show that there exist positive constants ,..., e* , such that 


LJs 

k=l 


0-1 


(E.4) 


where 


Bi ^ =^{7 : ^4(7) < - 7 j , 


(E.5) 
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are subsets of the corresponding Bi^,, for all fc = 1 , 2 ,..., g. 


If there exists an G X such that < 0 and = 0, then In this 

case, we just need to let and = M, for all r k, where M is an arbitrary- 


positive number. Under this choice of eds, the sets Sj’s dehned by (E.5) satisfy (E.4). 


If, on the other hand, Zj^. ^ 0 for all ik G X, i.e., all are open half spaces in 


D-l 


then we can End the constants eij,ei 2 , ■ ■ ■, Q, sequentially. For U, if Ufc= 2 ~ ^ 


can set = M. Then the resulting dehned by (E.5) satishes 


B,^ U U 5^3 U ■ ■ ■ U Bi^ = RP-h 


(E. 6 ) 


If 1 JL 2^4 7 ^ ^ then (E.3) suggests 


u^'d =n^f.- 


(E.7) 


Kk=2 


k=2 


For (E. 6 ) to hold, we just need to End an positive eij such that the resulting has nL2Bi 
as a subset, i.e., —should be larger than the maximum of hi^{'^) over the polyhedral region 
7 G fj ^^2 Note that maximizing hj^( 7 ) over the polyhedron can be represented as a 
linear programming question. 


maximize Zi^^j + (E. 8 ) 

subject to zj _(_^.)7 > -Zi^^j 




By Bertsimas and Tsitsiklis (1997| pp. 67, Corollary 2.3), for any linear programming 


problem over a non-empty polyhedron, including the one in (E. 8 ) to maximize hi^{^) = 
+ z^ (-j)T) either the optimal hi^i^) = oo, or there exists an optimal solution, 7 *. Here, 
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the latter case always occurs, because by (E.7), the maximum of hi^{^) over the polyhedron 
Cfk =2 ^ik exceed zero, so it does not go to inhnity. Hence, we just need to let 


e*! 2 


max Zi^j + z. 




so that the resulting = {7 : hi^^j) < —e,^} contains C\l= 2 ^ik ^ subset, which yields 


(E.6). After Ending 6*^, we can apply similar procednres sequentially, to find positive valnes 
Cif,, for k = 2,3,... ,q, snch that 


B,^^^ U---UBi^= RP-h 


After identifying all e^’s, the resulting Bi^^s satisfy (E.4). Note that the choice of ejj,’s only 


depend on the data Zj, i = 1, 2 ,..., n, so they are constants given the observed data. 

For each k = 1,2,... ,q, next we show that for any 7 G B^^, the likelihood fnnction of 
the Zfcth observation is bonnded above by This is becanse in a logistic regression, 

if ik E Ai, then 


pivik I f^jn) = fi WjKin)) = -—< 


. , , , ^ (E.9) 


if ik E Aq, then 


PiVik I Pjn) = fo i-Pjhikh)) = 


1 + Q-Pjhijl) 


< 




PjeikC 


(E.IO) 


Here, the last ineqnality holds because e ^ ^ for any f > 0. By (C.2), in a probit 


regression model, the ineqnalities (E.9) and (E.IO) also hold. 
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Now we show that the positive term E{(3j \ y)’*' has a hnite upper bound. 


EWj I y)^ = / Pj / p{y I /3j,-f)p{/3j,-f)d'ydPj 

Jo JRP-I 

poo 9 p 

- / /- P(y I 


< 


L \ /3jn)p{f3jn)dlfd(3j 
1 


“'0 k=l 

/*O 0 ^ /» 

< ftE ^ 

A:=l 
9 


0 


p{/3j,l)d-rd/3j 


^ -1 ^OO p 

= Y~ - Pildpl)dldPj 

k=l dBi^ 

1 f°° f 1 

-Y~ / p{Ppl)dldp, < ~ 


ZT ^4® ^0 


fe=l 


k=l 


□ 

Note that in Appendix the specihc formula of the prior density of (3 is not used. 
Therefore, if there is no separation in logistic or probit regression, posterior means of all 
coefficients exist under all proper prior distributions. 


F Proof of Theorem in the case of complete separation 

Proof. Here we show that if there is complete separation, then none of the posterior means 


E{/3j I y) exist, for j = 1,2,... ,p. Using the notation Zj, defined in (E.2), we rewrite the 
set of all vectors satisfying the complete separation condition ([^ as 


C = n {/3 € R” : zf/3 > 0} 


2=1 


According to Albert and Anderson (1984), C is a convex cone; moreover, if /3 G C, then 


f3 + 5 E C for any 6 eM.^ that is small enough. Hence, the open set C, as a subset of the MP 
Euclidean space, has positive Lebesgue measure. 
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To show that -E(/3j | y) does not exist, if C projects on the positive half of the axis, 
we will show that | y)^ diverges, otherwise, we will show that \ y)“ diverges (if 
both, then showing either is sufficient). Now we assume that the former is true, and denote 
the intersection 

C+ = Cn{f3 eW : /3j > 0}, 

which is again an open convex cone. Since Cd has positive measure in MP, under the change 
of variable from {(3j, f3(_j-)) to where /3(-j) = /5j7, there exists an open set G 

such that can be written as 

C/ = > 0,7 e C+}. 


Suppose that 7 can be written as ( 71 ,..., 7j-i, Ij+i, ■ ■ ■, 7p)^- We dehne a variant of it by 
7 ( 71 , ..., 7j-i, 1, 7j+i, • • •, such that /3 = jSj'j. Under the multivariate Cauchy prior 

(|^, the induced prior distribution of is 


p(/3j,7) oc - 




.p-i 


1 + Wjl - S-i - fi) 


p+i 




,p-i 


[(■7^E-i-7) P] - 2 (7^S-V) ft + + 1)] 


P+1 


Inside Cd there must exist a closed rectangular box, denoted by V'^ = {7 : 7 *, G 


[4,Mfc],fc = 1,... ,j - 1, j + 1,... ,p} C C+. By 


Browder 


(1996, pp. 142, Corollary 6.57), 


a continuous function takes its maximum and minimum on a compact set. Since is a 
compact set (closed and bounded in M^“^), 


a = max 7 ^S-^ 7 , 
7eb+ 


b = min 7 ^S 

7eb+ 


(F.l) 


both exist. 
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Recall that for all 7 G (hence including all elements in Zi^j + > 0; 

equivalently, if i G Ai, then Xjj + > 0, and if i G Aq, then Xij + xJ’(-_^.j 7 < 0. Now 

we show that in both logistic and probit regressions, the positive term E{f3j | y)+ diverges. 


E{Pj I y)^ = / / p(/^i,7)p(y I f3j,j)djd(3j 

Jo JMP-i 

poo p 

> / Pj I pi/dj,l) n +^f(-i)^)) n foi/dji^kj +^l,(-j)j))d'ydPj 


'0 Jv 

r-oo 


i&Ai 


k&Ao 


> 


If /to) n fo{ 0 )d'ydPj 

i&Ai k&Ao 

poo p 

/ Idj I pWj,o)dodPj 


>0 Jv^ 


/S: 


,p-i 


= C p, - 

Jo Jv+ [( 7 'rs-i 7 ) P] - 2 (7'^S-V) Pj + (r'^S-V + 1 )] 


-d'jdPj 


p+i “ I 
2 


> C 


= c 




0 JPt [a/3] - 2b/3, + V + 1)] 

-p-i 


-d'yd/dj 


p+i “ I 
2 


(//fc ^fc) 


fd- 


k=l 


£+1 d(3j 
Jo [a{(3j - b/aY + c] 2 


(F.2) 


= CXD, 


where C is a positive constant, c is a constant, and a and b have been defined previously in 

(|E3- 

On the other hand, if the set C of complete separation vectors only projects on the 
negative half of the (3j axis, following a similar deviation, we can show that E{(3j \ y)~ 
diverges to — 00 . 

□ 
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